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Abstract 

We prove a strong law of large numbers for a class of strongly mixing processes. Our result 
rests on recent advances in understanding of concentration of measure. It is simple to apply and 
gives finite-sample (as opposed to asymptotic) bounds, with readily computable rate constants. 
In particular, this makes it suitable for analysis of inhomogeneous Markov processes. We demon- 
strate how it can be applied to establish an almost-sure convergence result for a class of models 
that includes as a special case a class of adaptive Markov chain Monte Carlo algorithms. 

1 Introduction 



The strong laws of large numbers (SLLNs) play a fundamental role in statistics. They assert 
the convergence of empirical averages to true expectations, and, under appropriate assumptions, 
ensure that inferences about persistent world phenomena become increasingly more valid as data 
accumulates. The various laws of lar ge numbers date back at least to the publication in 1713 
of Jakob Bernoulli's Ars Conjectandi ( Bernoulli . 17131 ). which stated an early form of the weak 



law of large numbers. Subsequent development of the concept was carried out by many others. 
Among the first of these was Simeon-Denis Poisson, who generalized Bernoulli's result and gave it 
its modern name, "la loi de grands nombres". Other notable mathematicians who made further 
contributions include Chebyshev, Markov, Borel, Cantelli, and Kolmogorov. Over the years, the 
various contributions gave rise to two common forms, the weak and strong laws of large numbers, 
establishing conditions, respectively, for weak and strong convergence of empirical averages. 

The SLLN in its earlier forms applies to sequences of independent random v ariables. Howey er, for 
dependent sequences the theory is not as well developed. Indeed, to quote Ninness (j200Cll ). "for 



non-iid sequences the required SLLN results do not seem to be readily available in the literature" 
and thus researchers continue to work to develop generalized strong laws of large numbers. Since 
around 1960, a number of different generalizations have been obtained. Our goal is not to give 
a comprehensive survey of these results; the interested reader may find a useful list of relevant 
papers at http://www.stats.org.uk/law-of-large-numbers/, The basic current state of affairs 
may be summarized as follows. From Birkhoff's erg odic theo r em w e get a law of large numbers 
for ergodic processes; this has been strengthened by Breiman ( 196d ) to cover the case where the 



stationary distribution is singular with respect to the Lebesgue measure. Assumptions of ergodicity 
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are typically too weak to provide a convergence rate - this requires a stronge r mixing condi t ion. A 
classical (and perhaps first of its kind) example of the latter is the paper by Hanson et al. ( 19631 ) . 
which proves a strong l aw of large numbers under a mixing condition known in a modern form as 
V'-mixing fsee lBradlevI toO^ )). This mixing condition guarantees exponentially rapid convergence, 
but the proof does not directly yield rate constants. 

We approach the problem of developing such a SLLN with easily computable rate constants. This is 
useful for example, in the context of simulation-based algorithms, if one is interested in determining 
how many iterations are required to achieve a specified accuracy. To this end, we assume a stronger 
(though still quite realistic) mixing condition from which we obtain strong laws of large numbers, 
along with finite-sample bounds, for sequences of random variables with arbitrary dependence. In 
a sense, the application of concentration of measure theory to establish a SLLN is straightforward, 
al though an important technical coii t ribution of th i s paper is to extend the deviation bounds proved 



m 



Kontorovich and Ramanan ( 20081 ): Kontorovich ( 2006bl l3) for discrete measure spaces to the con 



tinuous case, as well as to bound certain mixing coefficients for adaptive Markov chains and to prove 
a strong law of large numbers for these (Theorem 15. 7p . We will state our results in a fairly general 
setting. We emphasize that the basic concentration result pre sented and employed h ere (Theorem 



3.6p is by no means the only one available. The recent result qflChazqttes et al.l (120071 ) yields a very 



similar inequality; earlier results along these lines appeared in Marton ( 19981 ). The intereste d reader 



may r e fer to authori t ative and c omprehensive surve ys of concen t ration tec hniques, such a,s iLedoux 
(|200lh . lSchechtmanl (|2003l l. andlLugosi (|2003l l or to IChatterie^ (|2005l l and iKontorovichI (|2007l l for 
more recent developments. 

We mention in passing that the strong laws we consider here are not uniform laws, the latter 
asserting almost sure convergence uniformly over some permissible (i.e., Glivenko-CantellQ) class 
of sets. There has been some work on such uniform laws for non-i.i.d. processes. In particular, 
Nobel and Dembol (Il993l ^ show that if ^ is a Glivenko-Cantelli class for an i.i.d. process, then the 
correspon ding uniforni strong law also holds for /3-mixing (absolutely regular) process. In the other 
direction, iNobe gives a counterexample where the uniform strong law fails for a stationary 

ergodic process. These two papers give ample background and provide illuminating discussions; 
they are an excellent starting point for anyone wishing to delve deeper into the topic. 

This paper is organized as follows. In Section [2] we set down the notations and definitions used 
throughout the paper. This is followed by Section [3l where we describe the martingale method as 
our main workhorse for proving concentration of measure results. Our main law of large numbers 
is stated in Section SI This strong law of large numbers is applied to adaptive Markov chains in 
Section m Finally, some technical lemmas are deferred to the Appendix. 



2 Notation and Definitions 



Let (fi", J^, P) be a probability space, where is the usual Borel c-algebra generated by the finite 
dimensional cylinders. On this space define the random process (^j)i<i<ra, Xi S Q. Throughout 

FoUowine IPollardI (Il984h . the term permissible is used to avoid measure-theoretic pathologies associated with 
taking suprema over uncountable collections of sets. 
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this paper, we assume P <C /i", for some positive Borel product measure /j^ = fi ^ fi 



' fi on 



We use the indicator variable l{^}(u;), equal to 1 if a; G A and otherwise, for a; G $7". For brevity, 
we will occasionally omit the argument to. The ramp function is defined by {z)^ = zlj^^Qj. For 
arbitrary Q, we define the Hamming metric on fi": 

n 

dHam{x,y) = X.^/G^i''. (l) 

i=l 



We need to introduce the notion of ?7-mixing, defined in iKontorovichI ( 2006c jl. We n ote th at this 
type of mixing i s by no nieans new; it can be traced (at le ast implicitly) to Marton ( 19981 ) and is 
quite explicit in lSamsonI hood ) and lChazottes et al.l (j2007l ). For 1 < i < j < n and x G 17*, let 



C{Xj.,n I Xi.i = x) 

be the law (distribution) of Xj-n conditioned on Xi-i = x. For y G fi*^^ and w' G define 

Vij{y,w,w') = \\C{Xj.n\Xi.i = {y,w)) - C{Xj.n\Xi.,i = {y,w'))\\^^, (2) 
where for signed measures i^, ||i^||rpy = [supy^ i^{A) — inf^i i'(A)]/2 is the total variation norm and 

f]ij = ess sup r]ij{y,w,w'), 

the essential supremum being taken with respect to fj,. 

Remark 2.1. The definition in ([2]) is flawed as stated, as it does not adequately handle the cas e wher e 



sets of measure zero are conditioned on. A more precise definition is given in KontorovichI ( 20081 ). 
Taking ^"(ri) to be the set of all strictly positive probability measures fi on fi"" (i.e., //(x) > for 
all X G ^"'), it is not hard to show that the functional fjij : 7'"(r2) ^ R is continuous on ^"(0) 
with respect to IHI^v- However, this continuity can break down on the boundary of ^"(17). Thus, 
we may arbitrarily define fjij on any probability measure by 



r]ij{n) = inf lim rjijifik) 

{Mfe} 

where the infimum is taken over all sequences {fJ-k '■ fJ-k ^ 'P+(f7), — /i| 



(3) 



0}. 



See Section 5.4 of IKontorovichI (|2007l ) for a discussion of the continuity of rj and conditioning on 
sets of measure zero, as well as motivation for the definition in ([3]). 



Let An be the upper-triangular n x n matrix defined by (A„)jj = 1 and 

for 1 < i < j < n. Recall that the ioo operator norm is given by 

||A„||oo = max (1 +ryj j+i + . . . +f/i„). 

l<i<n 



(4) 



(5) 



^T his notion of mixing is distinct from, and not to be confused with 77-weak dependence of jPoukhan and LouhichU 
j 19991 ). 
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We will occasionally want to make the dependence of A„ on the probability measure P explicit; to 
do this we will write A„(P). Let us collect some simple observations about : 

Lemma 2.2. Let P <^ jj'^ be a probability measure on Then 

(a) l<||A„(P)||^<n 

(b) ||A„(P)||^ = 1 iff P is (equivalent a.e. [/i"] to) a product measure 

(c) if Q <ti /u™" is a probability measure on il™ then 

||A„+„(P®Q)||^ < max{||A„(P)||^,||A™(Q)||^}. 



These properties are established in iKontorovichI (120071 ) . which also gives a discussion of the rela- 



tionship between ?7-mixing and (/)- and other kinds of mixing. 

3 Concentration via Martingale Differences 

Recall our probability space (O", .F", P) and let J^i be the a-algebra generated by {Xi . . . Xi), which 
induces the filtration 

{0, n''}=ToCJ'iC...CJ'n= ^- (6) 
For i = 1, . . . , n and / G Li(r2", P), define the martingale difference 

Vi = B[f\T,]-B[f\J^i_i]. (7) 
It is a classical resulln, going back to lAzumal (|l967l l. that 



-t^ 



P{|/-E/|>t} < 2exp . (8) 

\^2^i=l II ^i\\<xj 



Thus, if we are able to uniformly bound the martingale difference. 



max \\Vi\\^ < Hn, 

l<i<n 

we obtain the concentration inequality 

P{|/-E/|>t} < 2exp(-^). (9) 

Recall our assumption that dP{x) = p{x)d^'^{x) for some positive Borel product measure = /i (8) 
^® . . .® ^ on (0",.^). Similarly, the conditional probability satisfies P(- | Ti) <C z^""*, with density 



^ See lLedouxl l|200lf ) for a modern presentation and a short proof of 
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p{- 1 Xi-i = xi:i). Here and helow p{xj-n \ xi-i) will occasionally be used in place oip{xj.n \ Xi-i = xi:i); 
no ambiguity should arise. 

For / G Li(ri",P), 1 < z < n and yi-i G n\ define 

V^if-yi:,) = B[f{X)\Xi.,i = yi.,,]--E[f{X)\Xi.,,^i=yi.,i^i]; (10) 
this is just the martingale difference. 

A slightly more tractable quantity turns out to be 

Vi{f;yi:i^i,m,w',) = B[f{X)\Xi.,, = {yi.,,^i,Wi)]-E[f{X)\Xi.,i = {yi.,i^i,w% (11) 

where Wi,w[ € ^2. These two quantities have a simple relationship, which may be stated symbolically 
as 

\\ViU:-)\\L^{v) < mfr)\\L^iPy, (12) 
this is proved in ( Kontorovich . 2006d . Lemma 4.1). 



The next step is to notice that Vi{-]yi:i-i,'Wi,w'j), as a functional on Li(r2"',P), is linear; in fact, it 
is given by 



V,{f;yi..^.i,Wi,w'^ = / /(x)5(x)d/i"(x) = {f,g), (13) 

where 

g{x) = l{^^^^ = (y^^^_j,^^)}p(Xi+l:„ I {yi;i-l,Wi)) - l^^^ _^y^^_^y^yp{Xi+l;n I {yi:i-l,Wi)). (14) 

The plan is to bound {f,g) using continuity properties of / and mixing properties of X, which will 
immediately lead to a result of type ([9]) via (fT2]) . 

3.1 $ and ^ Norms 



To sta te our results in sufficient generality, we shall borrow several definitions from Kontorovich 



(|2007l ). Let {X,p) be a metric space and recall the definition of the Lipschitz constant of an 
f : X ^R: 

WfW 1/(3^) - f{y)\ ^ y 

"^"- = 5? P{x,y) ' ^''^ 
Recall also the definition of the diameter: 

diamp(^Y) = sup p{x,y). 

x,yeX 

Let p he a positive Borel measure on a measurable space and let Fn = Li(r2", .F", /x") be 

equipped with the inner product 

{f,g) = [ f{x)g{x)d^,^{x). (16) 
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Since f,g & might not be in L2(0"', J^"^, /x"), the expression in (jl6p in general might not be finite. 
However, for g € Loo{^}^,T^, ^'^), we have 

\{f,9)\ < ll/llilblloo- (17) 

Define the projection operator vr : F„ ^ Fn-i as follows. If / : 17" — > M then (vr/) : rj"""*^ — > M is 
given by 

(7r/)(x2,...,x„) = / /(xi )dti{xi). (18) 
Jn 



Note that by Fubini's theorem (Thm. 8.8(c) in iRudinI (|l987l )l. vr/ G Li(0"-i, /x"-^). Define the 
functional "ifn : F„ — > M recursively: \I'o = and 



^n(/) = fn-l(^/)+/ (/(X))^d^"(x) (19) 

JO" 

for n > 1. The latter is finite since 

*n(/) < (20) 

as shown in the following Lemma. 



Let <!>„ C Fn be the set of ah measurabl^ / : 17" ^ [0, diamp(r2")] with 11/11^ < 1 and define two 
norms on F„: 



^ = sup \{f,g)\ (21) 
ge^n 



and 



= max ^„(s/). (22) 

se{-i,i} 

We refer to the norms in (|2ip and (j22p as <&-norm and ^'-norm, respectively, and summarize some 
of their properties: 

Lemma 3.1. Under mild measure-theoretic regularity conditions on (O", .F, ;u"), which cover the 
case of Q countable and = R with Lehesgue measure (the metric being p = d^^^ in either case), 
the Junctionals defined in [21\) and [2^) satisfy 

(a) ^-norm and ^-norm are valid vector-space norms on Fn 

(b) for all f eFn, 

WWl, < WfU < n\ 

(c) for all f G Fn, 

\\\f\\L, < 11/11^ < n 



Note that II/Hlip < 1 does not guarantee that / is J'^-measurable, so the requirement that C Fn is essential. 
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Proof. The claim in (a) is proved in (jKontorovichl . l2006d . Theorems A.l, A. 2, A. 3); (b) and (c) are 
proved ibid, in (52) and Theorem A. 1(b), respectively. □ 

Definition 3.2. A metric space (0",/3) is said to be ^-dominated with respect to a positive Borel 
measure on Q if the inequality 



sup {f,g) < *„(/) 



(23) 



holds for all f ^ Fn- 

Remark 3.3. We have defined ^>„, ^'n(')i ( 
explicitly what space they are being defined over, we will use the notation ^n",/i"(')) ^'^d 



in an abstract measure space. To indicate 



When r2 is a finite set with counting measure v, and p = dnam) we have 



(24) 



for all / : 17" ^ R and q £ ^pn yn; in other words, (17", d nam) is ^-domina. t ed wit h respect to 
u. This was fir st proved in Kontorovich and Ramanan ( 20081 ): see Kontorovich ( 2006al ) for a much 
simpler proof. Kontorovich (|2007l ) extends this to countable Q. The goal of the remainder of this 
section is to establish the analogue of (j24p for 17 = M with the Lebesgue measure fi. 



Theorem 3.4. Let be the Lebesgue measure on R" and take p = dnam- Then 
^-dominated. That is, for all f G Li(R",/x") and g G ^*r",^", we have 



IS 



'(/)• 



Remark 3.5. Observe that ([25]) is equivalent to 

Wfh < Wfh^ /€Li(M",^"). 
The proof will closely follow the argument in KontorovichI (2006c, Theorem 8.1). 



(25) 
(26) 



Proof. Let Cr deno te the space of continuous functions / : R" R with compact support; it follows 
from (jRudinl . Il987l . Theorem 3.14) that Cc is dense in Li(R",;u"), in the topology induced by \\-\\]^ ■ 
This implies that for any / € Li(R",/i") and e > 0, there is a 5 € Cc such that ||/ — g\\i^ < ejn 
and therefore (via Lemma l3.1l fbl and (c)), 

11/ - slU < e and y - g\^ < e 
so it suffices to prove (I26p for f Cc- 

For m € N, define Qm C Q to be the rational numbers with denominator m: 

Qm = {p/r G Q : r = m} . 

Define the map '■ ^ — ^ Qm by 

7m(x) = max{q e Qm ■■ q < x} 
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and extend it to 7^ : ^ by defining [7m(2;)]i = Jmixi). The set C M" will be referred 
to as the m-grid points. 

We say that g G Li(R"',//'^) is a grid-constant function if there is an m > 1 such that g{x) = g{y) 
whenever ^rn{x) = ^m{y)'-, thus a grid-constant function is constant on the grid cells induced by Qm- 
Let Gc be the space of the grid-constant functions with compact support; note that Gc C -Li(]R", /x"). 
It is easy to see that Gc is dense in Cc- Indeed, pick any f £ Cc and let M G N be such that 
supp(/) C [— M, M]". Now a continuous function is uniformly continuous on a compact set, and 
so for any e > 0, there is a (5 > such that ujf{6) < e/(2M)", where ujf is the £00 modulus of 
continuity of /. Take m = [1/(5] and let g' E Gc be such that supp(g) C [— M, M]" and g agrees 
with / on the m-grid points. Then we have 

\\f-g\\L^<{2Mr\\f-g\\^^<e. 

Thus we need only prove pSl) for / G Gc, 5 G Gc n ^R^^^n. 

Let f £ Gc and g £ GcD ^m.^^^^ be fixed, and let m > 1 be such that / and g are m-grid-constant 
functions. Let R,if : — > M be such that ^(7^(2;)) = f{x) and (^(7m(x)) = g(x) for all x G M". 
As above, choose M G N so that supp(/) U supp(5) C [— M, M]". Then, denoting the counting 
measure on by z/", we have 

and 

Now Qm is finite and by construction, (f G ^q^,u", so ([MD applies. This shows {f,g)^n^n < 
^R",/i"(/) and completes the proof. □ 



3.2 Bounding the martingale difference 

The machinery of ry-mixing and ^'-dominance allows us to bound the martingale difference for a 
Lipschitz function of arbitrarily dependent random variables. 

Theorem 3.6. Let {Q,J^,fj,) be a positive Borel measure space and suppose that {Q'^,p) is ^- 
dominated with respect to fi for some metric p on Q". Let (il",.F",P) he a probability space with 
P <C //"■. Then for 1 < i < n, 

||V.(/;-)IL^(P) < ||/ILJ|A„(P)||^, (27) 
where Vi is the martingale difference defined in ^ and An is the rj-mixing matrix defined in 



Remar k 3.7. This result can be proved, almost verbatim, by the argument given in iKontorovich 



(|2006d . Theorem 7.1), so we only give a sketch of the proof here. 
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Proof. Since || Vi(/; •) and H/Hl; are both homogeneous functional of / (in the sense of T[af) = 
|a|r(/) for a G M), there is no loss of generahty in taking H/Hlip = 1- Additionahy, since Vi{f',y) 
is translation-invariant (in the sense that Vi{f;y) = Vi{f + a;y) for all a E M), there is no loss of 
generality in restricting the range of / to [0, n]. In other words, it suffices to consider / G ^n",ii"- 

From (fT2]) we have that it suffices to bound ||Vi(/; defined in ([TT]) . By (fT3]) . we have the 

equivalent form 

Vi{f;yi:i-i,Wi,w'i) = / f{x)g{x)dfi''{x) = {f,gi)Qn n, (28) 

Jo," 

where gi has a simple explicit construction (fHll . depending on yi-i^i,Wi,w[. 



It is shown in the course of proving (iKontorovichl . l2006d . Theorem 7.1) that 

(/, 9i) = {Tyf, TyOi) , 
where the operator Ty : Li(M",/i") ^ Li(l^"-'+\ /x"-'+^) is defined by 

{Tyh){x) = h{yx), for all x G n'^-'+\ 

Appealing to Theorem 13.41 we get 

{Tyf,Tyg,) < -^niTygi). (29) 
Furthermore, as shown ibid., the form of gi implies that 

n 

^n{Tygi) < 1+ X] (30) 
j=i+i 



□ 



establishing (p7|) . 



4 The Strong Law of Large Numbers 

We are now in a position to state our main result. 

Theorem 4.1. Let be a positive Borel measure space and suppose that (r2",(iHam) is ^- 

dominated with respect to fi. Define the random process Xi-^ on the measure space 
and assume that for all n > 1 we have P„ <C fJ.^, where is the marginal distribution on Xi-n and 
fi"' is the corresponding product measure on (fi",^"). Suppose further that the empirical measure 
defined by 

1 " 

^«(^) = -T.Mx.eA}, Ag:F, (31) 



n 

i=l 
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has uniformly converging expectation: 



lim 

n— »oo 



EP„(-)-K-) 







and define uq = no(e) to be such that EP„(-) — z^(-) < e for all n > no(e). 

TV 

Then Pn{A) converges to I'iA) almost surely, exponentially fast: 



'n[\Pn{A) -iy{A)^>t + e^ < 2exp(-ntV2 ||A„| 



(32) 



for all n > no(e), where A„ is the 7]-mixing matrix defined in equation 

Particular cases of interest include countable with taken to be counting measure and 0, 
with n taken to be Lebesgue measure. 



Proof. This result follows directly from Theorem l3.6l by observing that the function if a : Xi-n ^ 
defined by ipA{Xi-n) = Pn{A) has Lipschitz constant 1/n. □ 



Corollary 4.2. Under the conditions of Theorem \4-l\ Pn converges to v in distribution, almost 
surely. 



Proof. This is an immediate consequence of the (first) Borel-Cantelli lemma. 



□ 



5 Concentration of Marginals of Markov chains 

For clarity of presentation, we take all the state spaces to be finite, until indicated otherwise in 
Section [5. 4[ Everything extends easily to the continuous shown in the Appendix. 



5.1 Bounding r/jj via contraction coefficients 

Consider a Markov process 

{Wt)t=l,2,... 

taking values Wt = {Xt,Yt) G x Q^, defined on the probability space ((fio x ^^h)^, {^o^^h)^, P); 
the subscripts 'o' and 'h' are used to suggest "observed" and "hidden" states. Suppose that we are 
interested primarily in the marginal behaviour of (Xt). (This might be the case, for instance, if we 
can observe (Xt) but not (It) as when analyzing hidden Markov models. It is also of interest in 
the context of analysis of adaptive Markov chain Monte Carlo schemes as described later in this 
section.) Let us write the transition kernel of {Wt) as 



Ktiw, A) = P(Wt+i eA\Wt = w), weiQoX ^h), Ae{To® 
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and assume that the initial distribution of the process is 

po(^) = P{Wi G A). 



When can be embedded into a higher-dimensional Markov process as stated above, we will 

call (Xt) a Markov marginal chain (MMC). Note that Markov marginal chains properly contain the 
ordinary Markov chains, and are easily seen to be equivalent in expressive power to hidden Markov 
chains. 

In this section we apply the results established in the previous sections to obtain relatively simple 
conditions for convergence of empirical measures associated with the MMC. 

Let us define the i^^ contraction coefficient 9i of the MMC defined above by 



sup sup 



X 



X 



(33) 



We obtain a bound on the r/-mixing coefficients of the MMC in terms of its contraction coefficients: 
Theorem 5.1. The MMC {Xt) on 17^, as defined above, satisfies 

fjij = eiOi+i-.-ej-i. (34) 



Proof. The proof is greatly simplified by an observation of Marton ( 200?!) - na mely, that a Marginal 
Markov chain is a special case of a Hidden Markov chain (see Rabineil ( 19891 ) ) . This is easily 
by considering the function 



seen 



TT : 



which projects an (observed, hidden) pair onto its observed component, r nappin g a Markov chain to 

i ts hid den Markov marginal. It has already been shown (see Kontorovich ( 2006bl ) or Kontorovich and Ramanan 

:hat the r/-mixing coefficients of a hidden Markov chain are controlled by the contraction 
coefficients of the underlying Markov chain, in the manner of ()34p . □ 



5.2 Tensorization lemma 



Lemma 5.2. Suppose we have a MMC on (ilo x f^h)" defined by transition kernels {Ki{- \ •)}i<i<n' 
which have the following special structure: 



( u° 













Ai{u"\v°,v^)B,{u^\v°,v^). 



(35) 



Then we have 



< ai + Pi - aiPi 



(36) 
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where 0.i is defined in I133\) . and 



ai = max max 
f3i = max max 



Ai {■ \ X , X ) Ai {• \ X , X ) 



Proof. The claim follows i mmed iately from the well-known total variation tensorization lemma (see 



for instance Kontorovich ( 20071 )). which states that if fi, /i' are probability measures on A' and v, i 
are probability measures on 3^, then 



Lu 1/ — /i' (g) I/' 



< ll^-^'llw + lk-^'l 



ILL — n W \w — V w 

K 1 1 TV II 1 1 TV 



where ^® v \s a. product measure on A" x 3^. 



□ 



5.3 Concentration of adaptive Markov chains 

Throughout our calculations, n will be a fixed positive integer. Let F be an index set and {K^[- \ Ol^gr 
be a collection of Markov transition kernels, : — > 17. For a given sequence 7i:n-i ^ F"^^, we 
have a Markov measure on O": 



n-l 



^IV.n-l (^) 



PQ{xi)Y\K^^{xi+i\xi), x&Q''- 



i=l 



For 1 < i < n, a; € ri, and 7' € F, let gi{- \ x,j') be a probability measure on F. Together, {K^} 
and {gi} define a measure ^ on $7", which we call an adaptive Markov measure: 



n-l 



X] Po(a;i,7i) n[^*('>'«+i \xi,Ji)Kj^{xi+i\xi+i] 



i=l 



where 71 is a dummy index to make gi{- \ ■) well-defined. 
Define the following contraction coefficients 



(37) 



max ||ii'^(- 1 li;) — ii'y (• 1 1«')| 

ui,u)'gr2,7,7'Gr 



(38) 



and 



max \\gi{-\w,-f) -gii-\w ,7 ) 



(39) 



Then 
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Theorem 5.3. For an adaptive Markov measure ^ on 0" we have 



riij 



< 



where 



K + X: — nXi 



Proof. First we observe that the adaptive Markov process defined in (j37p is in fact an MMC, with 
0,0 = 0, and Oh = F; thus Theorem 1 5 . 1 1 apphes . Furthermore, the MMC kernel (denoted here by Pi 
instead of Ki to avoid confusion with K^) decomposes as in ([35]) : 



Pi 



gi{l\x' ,-i')K^,{x\x'). 



The claim is proved by applying Lemma 15.21 to (j40]l . 



(40) 
□ 



5.4 Example: Application to Adaptive MCMC Analysis 

As one application, we consider the analysis of a family of adaptive Markov chain Monte Carlo 
schemes. Such schemes hav e been considered in some detail by a nun i ber of authors, beginning with 



studies of a specific scheme ( Haario et al.. 2001 : Andrieu and R obert. ^ 00 ll ) and later being general- 



i zed b y lAtchade and Rosenthal (20051 ) , lAndrieu and Moulines (2006, ) , and [Roberts and Rosenthal 
(l2007l'l . Mari v othe r app roaches to adaptation have b een developed; these include, for example 



Doucet et aP (l200dl and iBrockwell and Kadaii3 (|2005l ^. We consider the general framework of 
Roberts and Rosenthall ( 2007 ). and demonstrate that by imposing a stronger form of their so-called 
"diminishing adaptation" condition, one is able to strengthen the weak law of large numbers they 
establish to a strong law of large numbers. 

Consid er a stochastic process { Xt £ R , t = 0, 1, . . .} on (M^, J^^,P). Adopting similar notation to 



that of [Roberts and Rosenthall ([2007[ ). define a family {K^{-, •), 7 E ^ C M} of transition kernels 
such that for each 7 € ^, i^'y(-, •) is irreducible, aperiodic, and ergodic with a limiting distribution 
vr on (M,.F). One would typically take each such kernel to be a Metropolis-Hastings kernel with 
certain parameter values determined by 7. 

For fixed 7, a homogeneous Markov chain whose joint distributions are determined by an initial 
value and transition kernel K^/ would have marginal distributions converging in total variation norm 
to vr. However, in adaptive MCMC problems, interest centers on the behaviour of a more complex 
process {^t}. Rather than holding 7 fixed, one allows the transition kernel to vary over time. To 
be precise, we specify an initial value Xq = xq, along with transition probabilities 

P{Xt+ie A\Xt=x) = KTAx,A), 

where each F^ € ^ is some function of Fq, . . . , Ft_i, Xq, . . . , X^. Thus the kernel used at time t is 
itself random, depending on the past history of the process. This means that {Xt} is not Markovian. 



To see how we can apply Theorem 14.11 in this context, we introduce some assumptions. 
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Assumption 5.4. There exists a random limiting kernel indexed by Too{io) such that 

Ttiuj)^Tooiuj) ytoen. (41) 

Furthermore, convergence to the random limit is uniform in the sense that there exists a non-negative 
monotone non-increasing sequence {nt} such that 

\Tt{u;)-T^{u;)\<Kt Vco e n. (42) 

Assumption 5.5. For each 'j Q, we have Ky{x,-) =^ K^{y,-) (i.e., converges in distribution) 
whenever x ^ y. 

Assumption 5.6. Each kernel K-y{-, •),^ € G is uniformly ergodic, with 

lim ||K*(x,.)-vr(.)||^^ = 0, (43) 

and satisfies the minorization condition 

K^{x, •) > moC(-), Vx G M, (44) 
where ^{■) is a probability measure on {R,J^) and rriQ is some positive constant. 



Assumptions 15 .4] and 15 . 6l are not restrictive. The first can be sat isfied by requiring that i r ^+i — T 



o(t~") for some a > 1. Intuitively, this is just a form of what [Roberts and Rosenthal! (|2007l ) refer 
to as "diminishing adaptation" . Assumption 15.61 ensures that all possible adaptive kernels mix at a 
minimal rate, and that any composition of kernels also mixes at that rate. The latter follows from 



the well-known bound on the contraction coefficient 6 via the minorization constant mo in ()44p : 

6 <1 — mo; 



see. 



for example, Lemma 2.2.3 in iKontorovichI (jioO^). One way to construct such a family of kernels 



is to choose a family of Metropolis-Hastings kernels in which the proposal distributions all share a 
common component which does not depend on the current state. 

Assumption 15.51 is a natural Feller-type condition; in particular, it is satisfied by most Metropolis- 
Hastings chains, and is fairly easily checked in practice. 

Using our main resul t along with these assumption s, we are in a position to state conditions under 
which Theorem 23 of Roberts and Eosenthall (200?! ) can be strengthened to establish strong instead 
of weak convergence. 



Theorem 5.7. Suppose that an adaptive Markov chain satisfies Assumptions 15.61 and [5? 
Then we have 

Pn{A) tt{A), a.s., 
for each A ^ T, where Pn{') is the empirical measure defined in 13 
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Proof. The minorization condit ion in Assumptionl5.6l ensures that the simultaneous uniform ergod- 
icity condition of Theorem 5 of [Roberts and Rosenthal! (|2007l ) holds. Assumption 15.41 ensures that 
the diminishing adaption condition of the same theorem is also satisfied. Then we have 



BPn{A) - 7TiA)\ <e + p{|p„(vl)-^(^)| >e} 



(45) 



and t he latter probability is bounded by a quantity independent of A, which goes to zero, by Theorem 
23 of Roberts and Rosenthal ( 200?! ) (the uniformity is not stated explicitly in the Theorem but is 
established in the proof that they give). This establishes the uniformly converging expectation 
condition of our Theorem 14.11 The minorization condition ()44p ensures that the mixing coefficients 



fiij decay as (1 — mo)-'~*. To establish almost-sure convergence, we are going to argue along the 
lines of Theorem 14.11 (the latter is not applicable directly, since the adaptive Markov chain measure 
P might not have a density with respect to the Lebesgue measure). Let no = no{e) be such that 

EPn{A) — tt{A) < e for all n > no(e). As in Theorem 14. 11 the quantity we want to bound is 



P 



Pn{A) - 7r{A) 



>t + e 



(46) 



for e, t > and n > rio(e) 



We now approximate the chain P by a finite-state chain P' induced by finite partitions, as shown 
in the Appendix. It follows from the arguments in the Appendix that for any E £ we can find 
a partition of the state space and a finite-state chain P' on ($7',^')" so that P'{E) is arbitrarily 
close to P{E) for some E G {J-')^ . Therefore, Pn{A) and 'k{A) can be made arbitrarily close to 
their finite-state analogues Pn{A) and 7r'(A) and in particular. 



P' 



P^(i)-^'(i) 



>t + e 



(47) 



approximates the expression in ()46p . Furthermore, Lemma I A . 1 1 shows that for a sufficiently refined 
partition, the finite-state chain P' will have mixing coefficients arbitrarily close to those of P. To 
conclude the proof it suffices to apply Theorem 14.11 and Corollary 14.21 □ 
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A Extending finite-state inequalities to more general spaces 

In bounding the mixing coefficients, measure-theoretic technicalities tend to play a peripheral role. 
Indeed, the = {0, 1} case already captures most of the proof complexity. The mixing results we 
proved for finite O extend verbatim to O = N, and under mild continuity assumptions, to much 
more general measure spaces. 
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The inequality in Theorem 13.61 applies to a broad class of state spaces, including i7 = M. The latter 
result estabishes that the r/-mixing coefficients of a random process control the concentration of 
Lipschitz path functionals about their means. This immediately implied the strong law of large 
numbers in Section HI 

In Theorem 15.31 we showed how to control the rjij in terms if the Markov contraction coefficients; 
this was done for finite state spaces. In this Appendix, we extend these bounds to the continuous 
case. 



A.l Markov marginal chains 

In the continuous case we shall consider the "observed" state space Qq and the "hidden" state space 
r^h) with the total space W = x Oh! one may take JIq = ^^h = We take the usual Borel 
(T-algebra on W, which is a product of the corresponding cj-algebras on and Oh- An MMC over 
is obtained by first defining the Markov process Wt = {Xt,Yt), with values in W, induced by 
the kernels {-f^i}o<i<n- Thus if Ai C W, 1 < i < n are measurable, then 

P{Wi€Ai,...,WneAn} = /•••/ flKi{wi.,,dwi). 



i=l 



We will write Kq for the initial distribution of the MMC, and expressions such as Ko(xo,dxi) are 
to be interpreted as Ko{dxi). Note that we are not assuming anything about the density of P; the 
latter may well not exist with respect to the Lebesgue (or any other product) measure on W". 

As in the discrete case, the MMC is obtained by marginalizing out the "hidden" component Y: 
P{XieBi,...,XneBn} = [[■■■[ flKiiixi-i,yi-i),idxi,dyi)) 

jf2JJ J Bl J Bn j — ]^ 

for measurable sets Bi C fio- 

The definition of the contraction coefficient readily generalizes to the continuous case: 

Oi = sup \\Ki{w,-)-Ki{w\-)\\^^. (48) 

We will use the notation 6i {P) if we wish to make explicit the dependence on the particular Markov 
measure. 



Similarly, we define 



r]ij = sup r]ij{y,w,w'), 



and write rjij^P) to make explicit the particular measure. 

All sets below are assumed to be Borel-measurable. Recall that a partition of a set E is a. collection of 
disjoint sets whose union is E. Whenever ^ is a collection of subsets of E, A induces an equivalence 
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relation on E as follows: 



x=Ay 



l{zeA} = '^{yeA} for all x,y e E,A £ A. 



Such an equivalence relation in turn induces a partition on E (whose members are the equivalence 
classes); we shall call such a partition the refining partition of A. Notice that if A is finite then so 
is its refining partition. 

Let P be a Markov chain on W", as above. Any finite partition {V^ : 1 < A; < m} of W induces a 
Markov chain P with kernels 



0<i<n 



over the finite state space VV = {1, . . . , m}, as follows: 



If W = r^o X ^^h) {Sk} is a partition of JIq; and {Ti} is a partition of Jlh, then these two partitions 
induce a partition on W in the obvious way. We will write W = JIq x ilh to denote the state spaces 
obtained by identifying the partition blocks with states. 

Lemma A.l. Let P he a Markov chain on W" = (Jlo x S^h)" and Q be the induced MMC on $7". 
Then, assuming that the kernel generating P satisfies the Feller continuity condition in Assump- 
tion \5.5\ we have that for any e > there are finite partitions of and Oh such that the induced 
Markov chain P onW"- = (Oq x ^h)"" and MMC Q on Oo" satisfy 



0iiP) - 6i{P) 



< £ 



(49) 



and 



VijiQ) - VijiQ) 



(50) 



for 1 < i < j < n. 



Proof. Fix an e > 0. Let us construct the requisite partition for (j49p . Fix an 1 < i < n and let 
t = Oi{P). Then by definition of |t-||^y there are w and w' in W, and an E CZ W such that 

t-e/2 < Ki{w,E) - Ki{w',E) < t. 

Furthermore, E may be approximated by a finite 
A; = 1, . . . , mA, and Bi C ilh? = 1, . . . , m^: 

E = y \J AkXB,, (51) 

k=l,...,mA 1=1,. ..^mg 

so that 

t-e < Ki{w,E) - Ki{w',E) < t. 

By Feller continuity, we have 

Ki{w, E) = lim P{Wi+i eE\W^eUo) 
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where Ui D U2 ^ ■ ■ ■ 3 w is a sequence of neighborhoods shrinking to w in the sense that = w. 

(We are assuming without loss of generahty that P[Ua) > for all a.) 

Therefore, taking w = {x,y) and w' = {x',y'), we have that there are neighborhoods C,C' C of 
X and x' (respectively), as well as analogous neighborhoods D,D' C J^h of y and y' , such that 

P(Wi+i eE\W^£CxD)- P{Wi+i £ E\WieC' X D') 

is arbitrarily close to Ki{w,E) — Ki{w',E). 

Let {Sa} be the partition of refining the sets {^fc}, C, and C. Similarly, let {Tb} be the partition 
of Oh refining the sets {Bi}, D, and D' . Identify the partition blocks with states and induce the 
finite-state Markov chain P on W = ilo x ri^- By the approximation argument above, there exist 
C, C, D, D' such that P satisfies (|49p for the given i. Repeating this process for each i = 1, . . . , n — 1 
and picking a finite partition of ilo (respectively, ilh) that simultaneously refines the partitions for 
each i, we establish ()49p for all i. 

Now we turn to (j50p . The basic technique is the same as the one used to show (j49p . Fix 1 < i < j < n 
and let h = fjij{Q). Then there are xi-s-i S J^o~^i Xi,x^ S CIq and an j4 C ri"""'^^ such that 

h-e/2 < Q{Xj:n G A \ Xi,, = xi-a) - Q{Xj.,n G A I Xi.,i = xi.,,^ix') < h. 

As done with E and E above, Q{Xj-n E A \ Xi-i = xi-i) may be approximated by Q{Xj-n G A \ Xi-i = 
xi;i) where A is now an (n — j + l)-fold product of expressions of the type UkAk, with C ^q. 

Again as above, we take small neighborhoods around xi,X2, ■ ■ ■ ,Xi,x'j^ to obtain a partition which 
satisfies ()50p for the given i,j. Refining such partitions simultaneously for 1 < i < j < n, we 
establish (I50p for all these values. 

Finally, the partitions obtained in the course of proving (j49p and (j50p can once again be refined to 
make the two inequalities hold simultaneously. □ 

Corollary A. 2. Let P be a Markov chain on W" = (Jlo x S^h)" and Q be the induced MMC on 
n^. Then 

%(Q) < WOkiP). 

k=i 

Proof. Immediate consequence of the corresponding claim for finite and Oh- □ 
A. 2 Tensorization 

Lemma A. 3. Suppose the Markov chain P on (r^oxOh)" is defined by transition kernels {^i}o<i<n' 
which have the following special structure: 

Ki{{x',y'),{x,y)) = A{x\ {x' ,y'))B,{y \ {x' ,y')). (52) 
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Then we have 



Oi < ai + l3i-aA (53) 

where 

ai = sup \\A,{-\{x\y'))-A,{-\{x",y"))\\ 

Pi = sup \\Bii-\{x',y'))-B,{-\{x",y"))\\^^. 

{x',y'),{x",y")enoXni, 

Proof. The same technique of approximating P by a finite-state Markov chain P as employed in 
Lemma lA.ll mav be used here; details are omitted. □ 



A. 3 Adaptive Markov chains 

Let r be an index set and {K^}^^-p be a collection of Markov transition kernels, : 0, ^ 0,. For 
a given sequence 7i;n-i G F""^, we have a Markov measure on Q,"': 

n-l 



dP-yi-.n-ii^) = YiK^i{xi,dXi+i), xeQ". 



1=0 



For 1 < i < n, X & and 7' € F, let gi{- \ x,'y') be a probability measure on F. Together, {K^} 
and {gi} define a measure Q on fi", which we'll call an adaptive Markov measure: 



„ n— 1 

dQ{x) = / Y[[gi{d-fi+i\xi,ji)K^^{xi,dxi+i)], 

i = l 



(54) 



where 71 is a dummy index to make gi{- \ ■) well-defined. 
Define the following contraction coefficients 



K = sup \\K^{w,-) - Ky{w',-)\\^^ . (55) 



and 



Aj = sup \\gi{-\w,-f) - gi{-\w',j')\\^^ . (56) 



Then 
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Theorem A. 4. For an adaptive Markov measure Q on 0" we have 

Vij < OiOi+i ■ ■ ■ Oj^i 

where 

9i = K + Xi - K\i. 

Proof. First we observe that the adaptive Markov process defined in (I54p is in fact a MMC, with 
= and = T; thus Corollarv IA.2I apphes. Furthermore, the MMC kernel (denoted here by 
Pi instead of Ki to avoid confusion with K^) decomposes as in (|52p : 

P,{{x',j'),{dx,dj)) = gi{d-f\x',-f')Ky{x',dx). (57) 

The claim is proved by applying Lemma IA.3I to (j57|) . □ 
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